Skip to content

Voicenote Storage: Optimize audio handling #815

Open
Prajna1999 wants to merge 6 commits into
mainfrom
feat/b64-to-s3
Open

Voicenote Storage: Optimize audio handling #815
Prajna1999 wants to merge 6 commits into
mainfrom
feat/b64-to-s3

Conversation

@Prajna1999
Copy link
Copy Markdown
Collaborator

@Prajna1999 Prajna1999 commented May 6, 2026

Summary

Target issue is #804
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Audio inputs now accept URLs in addition to base64-encoded content
    • Audio outputs are automatically stored in cloud storage with temporary signed URLs for downloads
    • Automatic MIME type detection ensures correct file format handling for audio assets
  • Improvements

    • Enhanced audio payload management with cloud storage integration for improved scalability

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

This PR extends the LLM audio pipeline to persist large audio payloads to S3 and serve them via presigned URLs. Audio inputs and outputs are uploaded on-demand, with URI-only representations stored in the database while clients receive short-lived download links.

Changes

Large Audio Asset Storage and Presigned URL Serving Pipeline

Layer / File(s) Summary
Audio Schema Extensions
backend/app/models/llm/request.py
AudioContent.format now accepts "url" in addition to "base64", the value field documentation covers both base64-encoded and public URLs, and a new optional uri field stores presigned object-storage URLs.
S3 Upload Infrastructure
backend/app/core/storage_utils.py
New _MIME_TO_EXT mapping and upload_audio_bytes_to_s3 function format filenames from call_id and MIME type (defaulting to wav) and upload bytes via upload_to_object_store, returning the S3 URI or None on failure.
CRUD Helpers and Audio Resolution Utilities
backend/app/crud/llm.py, backend/app/utils.py
serialize_input handles URL audio as a dedicated JSON payload; update_llm_call_response skips size calculation for non-base64 audio; update_llm_call_input persists S3 URIs to the database; download_audio_bytes fetches remote content with timeout handling; resolve_audio_url writes downloaded bytes to temp files; resolve_input routes based on format discriminator.
STT Audio Input Persistence
backend/app/services/llm/jobs.py (lines 572–625)
When an LLM call receives STT audio (base64 or URL), the job downloads/decodes the bytes, uploads them to S3, and updates the persisted LLM call input to a URI-only JSON record, continuing execution on failure.
TTS Audio Output Persistence and Presigned URLs
backend/app/services/llm/jobs.py (lines 723–946)
When a provider returns TTS audio in base64, the job uploads decoded bytes to S3, sets the in-memory content.uri to the S3 path, and stores a URI-only representation in the database. After successful execution, the job generates presigned URLs (3600s expiry) to replace raw S3 URIs in the response, clearing the URI on presign failures.
Chain Executor Audio Response Finalization
backend/app/services/llm/chain/executor.py
_resolve_presigned_url replaces AudioOutput S3 URIs with presigned URLs and is wired into chain teardown and intermediate callbacks to ensure responses contain short-lived signed URLs instead of raw object-store paths.

Sequence Diagram

sequenceDiagram
  participant Client
  participant Jobs
  participant ResolveInput
  participant S3Upload
  participant ChainExecutor
  
  Client->>Jobs: AudioInput (base64 or url)
  Jobs->>ResolveInput: resolve format
  ResolveInput-->>Jobs: audio bytes
  Jobs->>S3Upload: upload bytes
  S3Upload-->>Jobs: s3:// URI
  Jobs->>Jobs: update DB input to URI-only
  Jobs->>Jobs: send to LLM provider
  Jobs-->>Jobs: receive TTS audio (base64)
  Jobs->>S3Upload: upload TTS bytes
  S3Upload-->>Jobs: s3:// URI
  Jobs->>Jobs: store URI-only in DB
  Jobs->>Jobs: generate presigned URL
  Jobs->>ChainExecutor: final result with presigned URL
  ChainExecutor->>ChainExecutor: _resolve_presigned_url
  ChainExecutor-->>Client: response with https:// presigned URL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • ProjectTech4DevAI/kaapi-backend#574: Both PRs touch the shared LLM audio pipeline—backend/app/crud/llm.py (audio serialize_input / update_llm_call_response size handling) and backend/app/models/llm/request.py (AudioContent schema/format)—so the main PR's URL/audio changes are directly overlapping with the retrieved TTS/STT audio model integration work.

Suggested reviewers

  • vprashrex
  • nishika26

Poem

🐰 Audio files take flight on clouds so grand,
From base64 bytes to S3 land,
URLs signed and presently blessed,
Large payloads put to the test!
The pipeline hops with storage delight,
Making audio serving just right. 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: optimizing audio handling through S3 storage integration, which is the core focus across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/b64-to-s3

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 46.61017% with 63 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/utils.py 11.53% 23 Missing ⚠️
backend/app/services/llm/jobs.py 56.86% 22 Missing ⚠️
backend/app/crud/llm.py 31.25% 11 Missing ⚠️
backend/app/services/llm/chain/executor.py 53.33% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Prajna1999 Prajna1999 changed the title Feat/b64 to s3 Voicenote Storage: Optimize audio handling May 7, 2026
@Prajna1999 Prajna1999 linked an issue May 7, 2026 that may be closed by this pull request
@Prajna1999 Prajna1999 self-assigned this May 7, 2026
@Prajna1999 Prajna1999 moved this to In Progress in Kaapi-dev May 7, 2026
@Prajna1999 Prajna1999 requested review from Ayush8923 and vprashrex May 10, 2026 12:24
@Prajna1999 Prajna1999 removed this from Kaapi-dev May 11, 2026
@Prajna1999 Prajna1999 added the enhancement New feature or request label May 12, 2026
@Prajna1999 Prajna1999 moved this to In Review in Kaapi-dev May 12, 2026
@Prajna1999 Prajna1999 linked an issue May 12, 2026 that may be closed by this pull request
@Ayush8923
Copy link
Copy Markdown
Collaborator

Please fix the codecov before merging the PR.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
backend/app/services/llm/chain/executor.py (1)

73-73: ⚡ Quick win

Add a type hint for output.

_resolve_presigned_url is the only changed function here with an untyped parameter. Please annotate it explicitly, even if you keep the runtime isinstance guard.

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/llm/chain/executor.py` at line 73, The
_resolve_presigned_url method currently has an untyped parameter; add an
explicit type hint for output (for example output: dict[str, Any] or
Mapping[str, Any], or a more specific TypedDict/Union if you have a known shape)
and keep the existing return annotation (-> None); update imports from typing as
needed and preserve the existing runtime isinstance guards inside
_resolve_presigned_url to validate the structure at runtime while satisfying the
project's static typing rules.
backend/app/core/storage_utils.py (1)

244-248: 💤 Low value

Extension fallback may mislabel non-WAV audio when MIME type is unknown.

When mime_type is provided but missing from _MIME_TO_EXT (e.g., audio/x-m4a, audio/3gpp), the filename gets .wav while the upload's content_type is set to the actual MIME. This creates a filename/Content-Type mismatch in S3. A safer fallback is to derive the extension from mime_type itself via mimetypes.guess_extension, falling back to wav only when MIME is absent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/core/storage_utils.py` around lines 244 - 248, The filename
extension fallback currently uses _MIME_TO_EXT.get(mime_type or "", "wav") which
mislabels audio when mime_type is present but not in the map; modify the logic
in the block using _MIME_TO_EXT, mime_type, ext, filename and
upload_to_object_store to: if mime_type is truthy try _MIME_TO_EXT[mime_type]
first, else call mimetypes.guess_extension(mime_type) and strip a leading '.' if
present; only fall back to "wav" when mime_type is falsy or guess_extension
returns None/empty; then construct filename and call upload_to_object_store as
before so the file extension matches the upload content_type.
backend/app/services/llm/jobs.py (1)

731-749: 💤 Low value

Minor: TTS upload duplicates the work update_llm_call_response's base64 size-calc already does.

Inside the TTS branch you call base64.b64decode(tts_output.content.value) to upload. A few lines later in the same function the call chain reaches update_llm_call_response, which would also b64decode the same value to compute audio_size_bytes if db_content were still base64. The current code dodges that double-decode (because db_content is replaced with format: "uri"), but it's coincidental. Consider passing the decoded bytes once and threading size_bytes into db_content so both DB record and S3 upload use a single decode and stay consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/llm/jobs.py` around lines 731 - 749, The TTS branch
currently decodes base64 twice implicitly; instead decode once into tts_bytes
(from tts_output.content.value), compute audio_size_bytes = len(tts_bytes), pass
the decoded bytes into upload_audio_bytes_to_s3 (instead of re-decoding later)
and thread audio_size_bytes into the db_content passed to
update_llm_call_response so the DB record uses size_bytes and db_content can be
set to the "uri" form without needing to b64decode again; update references:
tts_output (AudioOutput), tts_bytes, upload_audio_bytes_to_s3,
update_llm_call_response, db_content and llm_call_id to ensure one decode and
consistent size metadata.
backend/app/utils.py (1)

609-625: 💤 Low value

Add a [download_audio_bytes] log line and handle empty bodies explicitly.

Minor follow-ups on resolve_audio_url:

  • download_audio_bytes has no log emission on success/failure, so when STT URL ingestion misbehaves there's nothing to correlate. Per coding guidelines, prefix any added log with [download_audio_bytes].
  • An empty audio_bytes (length 0) with no error is currently treated as success and written to a temp file. Worth guarding so the provider doesn't receive a 0-byte file.

As per coding guidelines: "Prefix all log messages with the function name in square brackets".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/utils.py` around lines 609 - 625, Add a log call in
resolve_audio_url that records the result of download_audio_bytes using logger
and prefixed with "[download_audio_bytes]" (include success/failure and any
error), then explicitly check for an empty audio_bytes (len==0) after
download_audio_bytes returns and treat it as an error (return "", "Downloaded
audio is empty" or similar) instead of writing a 0-byte temp file; update
behavior in resolve_audio_url and reference the download_audio_bytes call and
logger to locate where to insert the log and the empty-body guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/services/llm/chain/executor.py`:
- Around line 73-91: The helper _resolve_presigned_url currently mutates the
persisted Output object (changing output.content.uri), which causes the durable
s3:// reference to be overwritten; instead, create and mutate a shallow/deep
copy of the Output (or just copy its content.uri into a new Output/Content
instance), generate/set the presigned URL on that copy (or set a separate
presigned_uri field/local variable) and leave the original Output untouched;
change _resolve_presigned_url to return the copied/modified Output (or presigned
URI) and update callers (e.g., _teardown and the call site that forwards
result.response.response.output into update_llm_chain_status) to use the
returned copy for sending to clients/logs while preserving the stored result for
persistence.

In `@backend/app/services/llm/jobs.py`:
- Around line 574-624: The STT URL branch passes user-controlled
query.input.content.value directly into download_audio_bytes, exposing SSRF/DoS
risk; before calling download_audio_bytes (in the AudioInput branch in
execute_llm_call), validate the URL is HTTPS, resolve and reject
private/internal IPs, enforce max download size and safe redirect behavior (or
else rely only on a hardened download_audio_bytes that performs those checks),
and reject or log and skip on invalid URLs so upload_audio_bytes_to_s3 /
update_llm_call_input never operate on unvalidated remote resources.
- Around line 722-774: The current flow leaves db_content as the full base64
payload when any TTS S3 upload path fails; change the error paths inside the TTS
handling so that on upload failure (upload_audio_bytes_to_s3 returns falsy) or
any exception thrown by get_cloud_storage/b64decode/upload_audio_bytes_to_s3 you
replace db_content with a metadata-only record (e.g.
{"type":"audio","content":{"format":"base64-failed","value":None,"mime_type":tts_output.content.mime_type,"size_bytes":len(tts_output.content.value)
if tts_output and tts_output.content and tts_output.content.value else None}})
and log the failure, or alternatively raise/mark a job error so the caller
(execute_llm_call / update_llm_call_response) will not persist the full base64;
update the try/except and the upload falsy branch around tts_output, db_content,
get_cloud_storage, and upload_audio_bytes_to_s3 to ensure db_content is never
left pointing at the original model_dump containing the base64.

In `@backend/app/utils.py`:
- Around line 595-606: download_audio_bytes currently performs an unsafe
requests.get without URL validation, redirects handling, host resolution checks,
or streaming limits; update download_audio_bytes to (1) enforce HTTPS-only
(reuse validate_callback_url logic or call it), (2) disable automatic redirects
(requests.get(..., allow_redirects=False)) and on any 3xx response validate the
Location by resolving its hostname and re-checking against _is_private_ip before
following, or loop manually following redirects with per-hop re-resolution, (3)
resolve the request hostname to IP(s) and reject
private/loopback/link-local/reserved addresses using the existing _is_private_ip
helper, and (4) stream the response via iter_content with a running byte counter
and an enforced max bytes cap (and/or validate Content-Length before
downloading) to prevent memory/DoS; also ensure any HTTP (non-TLS) scheme is
rejected and surface clear error messages returned by download_audio_bytes so
callers like resolve_audio_url and jobs.py receive safe failures.

---

Nitpick comments:
In `@backend/app/core/storage_utils.py`:
- Around line 244-248: The filename extension fallback currently uses
_MIME_TO_EXT.get(mime_type or "", "wav") which mislabels audio when mime_type is
present but not in the map; modify the logic in the block using _MIME_TO_EXT,
mime_type, ext, filename and upload_to_object_store to: if mime_type is truthy
try _MIME_TO_EXT[mime_type] first, else call
mimetypes.guess_extension(mime_type) and strip a leading '.' if present; only
fall back to "wav" when mime_type is falsy or guess_extension returns
None/empty; then construct filename and call upload_to_object_store as before so
the file extension matches the upload content_type.

In `@backend/app/services/llm/chain/executor.py`:
- Line 73: The _resolve_presigned_url method currently has an untyped parameter;
add an explicit type hint for output (for example output: dict[str, Any] or
Mapping[str, Any], or a more specific TypedDict/Union if you have a known shape)
and keep the existing return annotation (-> None); update imports from typing as
needed and preserve the existing runtime isinstance guards inside
_resolve_presigned_url to validate the structure at runtime while satisfying the
project's static typing rules.

In `@backend/app/services/llm/jobs.py`:
- Around line 731-749: The TTS branch currently decodes base64 twice implicitly;
instead decode once into tts_bytes (from tts_output.content.value), compute
audio_size_bytes = len(tts_bytes), pass the decoded bytes into
upload_audio_bytes_to_s3 (instead of re-decoding later) and thread
audio_size_bytes into the db_content passed to update_llm_call_response so the
DB record uses size_bytes and db_content can be set to the "uri" form without
needing to b64decode again; update references: tts_output (AudioOutput),
tts_bytes, upload_audio_bytes_to_s3, update_llm_call_response, db_content and
llm_call_id to ensure one decode and consistent size metadata.

In `@backend/app/utils.py`:
- Around line 609-625: Add a log call in resolve_audio_url that records the
result of download_audio_bytes using logger and prefixed with
"[download_audio_bytes]" (include success/failure and any error), then
explicitly check for an empty audio_bytes (len==0) after download_audio_bytes
returns and treat it as an error (return "", "Downloaded audio is empty" or
similar) instead of writing a 0-byte temp file; update behavior in
resolve_audio_url and reference the download_audio_bytes call and logger to
locate where to insert the log and the empty-body guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dcc84054-a87a-4955-93e8-2d6518521cd4

📥 Commits

Reviewing files that changed from the base of the PR and between 1d02df6 and b3641d8.

📒 Files selected for processing (6)
  • backend/app/core/storage_utils.py
  • backend/app/crud/llm.py
  • backend/app/models/llm/request.py
  • backend/app/services/llm/chain/executor.py
  • backend/app/services/llm/jobs.py
  • backend/app/utils.py

Comment on lines +73 to +91
def _resolve_presigned_url(self, output) -> None:
"""Swap the s3:// URI in content.uri for a presigned URL in-place.

Non-fatal: clears uri on failure so clients don't receive a raw s3:// address.
"""
if isinstance(output, AudioOutput) and output.content.uri:
try:
with Session(engine) as session:
storage = get_cloud_storage(session, self._context.project_id)
output.content.uri = storage.get_signed_url(
output.content.uri, expires_in=3600
)
except Exception as e:
logger.warning(
f"[_resolve_presigned_url] Failed to generate presigned URL: {e} | "
f"job_id={self._context.job_id}",
exc_info=True,
)
output.content.uri = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid mutating the persisted output when presigning.

This helper rewrites the same result.response.response.output object that _teardown() later passes to update_llm_chain_status(...). That means the chain record will persist either a 1-hour signed URL or uri=None on signing failure, instead of the durable s3:// reference. Presign a response/callback copy and leave the stored output untouched.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/llm/chain/executor.py` around lines 73 - 91, The helper
_resolve_presigned_url currently mutates the persisted Output object (changing
output.content.uri), which causes the durable s3:// reference to be overwritten;
instead, create and mutate a shallow/deep copy of the Output (or just copy its
content.uri into a new Output/Content instance), generate/set the presigned URL
on that copy (or set a separate presigned_uri field/local variable) and leave
the original Output untouched; change _resolve_presigned_url to return the
copied/modified Output (or presigned URI) and update callers (e.g., _teardown
and the call site that forwards result.response.response.output into
update_llm_chain_status) to use the returned copy for sending to clients/logs
while preserving the stored result for persistence.

Comment on lines +574 to +624
if (
isinstance(query.input, AudioInput)
and query.input.content.format in ("base64", "url")
and llm_call_id
):
try:
if query.input.content.format == "url":
stt_bytes, dl_error = download_audio_bytes(
query.input.content.value
)
if dl_error or not stt_bytes:
raise ValueError(dl_error or "Empty audio bytes from URL")
# Rewrite to base64 in-place so the provider resolve path
# reuses these bytes instead of issuing a second HTTP download.
query.input.content.value = base64.b64encode(stt_bytes).decode()
query.input.content.format = "base64"
else:
stt_bytes = base64.b64decode(query.input.content.value)

storage = get_cloud_storage(session, project_id)
subfolder_path = f"orgs/{organization_id}/{project_id}/audio/stt"
s3_url = upload_audio_bytes_to_s3(
storage,
stt_bytes,
llm_call_id,
query.input.content.mime_type,
subfolder_path,
)
if s3_url:
stt_input_record = json.dumps(
{
"type": "audio",
"format": "uri",
"mime_type": query.input.content.mime_type,
"size_bytes": len(stt_bytes),
"uri": s3_url,
}
)
update_llm_call_input(session, llm_call_id, stt_input_record)
logger.info(
f"[execute_llm_call] STT audio uploaded to S3 | llm_call_id={llm_call_id}"
)
else:
logger.warning(
f"[execute_llm_call] STT S3 upload failed | llm_call_id={llm_call_id}"
)
except Exception as e:
logger.warning(
f"[execute_llm_call] STT S3 upload error, continuing: {e} | llm_call_id={llm_call_id}",
exc_info=True,
)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

STT URL ingestion inherits the SSRF / unbounded-download risk from download_audio_bytes.

This branch passes user-supplied query.input.content.value straight into download_audio_bytes to pre-fetch bytes for S3 upload. Until download_audio_bytes validates the URL (HTTPS-only, non-private IP, redirect handling, max body size — see the comment on backend/app/utils.py), this code path is reachable as an SSRF/DoS vector via the /llm/call API.

Once download_audio_bytes is hardened, this caller benefits automatically; no change is needed here. Noting it for cross-layer traceability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/llm/jobs.py` around lines 574 - 624, The STT URL branch
passes user-controlled query.input.content.value directly into
download_audio_bytes, exposing SSRF/DoS risk; before calling
download_audio_bytes (in the AudioInput branch in execute_llm_call), validate
the URL is HTTPS, resolve and reject private/internal IPs, enforce max download
size and safe redirect behavior (or else rely only on a hardened
download_audio_bytes that performs those checks), and reject or log and skip on
invalid URLs so upload_audio_bytes_to_s3 / update_llm_call_input never operate
on unvalidated remote resources.

Copy link
Copy Markdown
Collaborator Author

@Prajna1999 Prajna1999 May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done. You may recheck during pushing changes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

I'll verify the hardening in download_audio_bytes before confirming.


🏁 Script executed:

#!/bin/bash
# Check the current implementation of download_audio_bytes and related helpers in utils.py
rg -n "download_audio_bytes|validate.*url|SSRF|private.*ip|max.*size|max_size|redirect|https" backend/app/utils.py -A 5 -B 2

Length of output: 3080


@Prajna1999, after inspecting the current code, the hardening does not appear to be in place yet for download_audio_bytes. Here is what the function currently looks like (Line 595):

def download_audio_bytes(url: str) -> tuple[bytes | None, str | None]:
    """Download audio from a public URL. Returns (bytes, error)."""
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        return response.content, None

The SSRF/DoS protections are still missing:

  • ❌ No HTTPS-only enforcement
  • ❌ No private/internal IP rejection
  • ❌ No max body size limit
  • ❌ No redirect blocking (allow_redirects defaults to True)

Notably, _is_private_ip and validate_callback_url already exist in the same file and do enforce those protections — but they are only wired to send_callback, not to download_audio_bytes.

The fix would be to reuse that existing infrastructure. For example:

_MAX_AUDIO_BYTES = 25 * 1024 * 1024  # 25 MB

def download_audio_bytes(url: str) -> tuple[bytes | None, str | None]:
    """Download audio from a public URL. Returns (bytes, error)."""
    try:
        # Reuse the same SSRF-safe validation already applied to callback URLs.
        validate_callback_url(url)
    except ValueError as e:
        return None, str(e)

    try:
        with requests.Session() as session:
            session.trust_env = False
            response = session.get(
                url,
                timeout=30,
                allow_redirects=False,
                stream=True,
            )
            response.raise_for_status()
            chunks = []
            total = 0
            for chunk in response.iter_content(chunk_size=65536):
                total += len(chunk)
                if total > _MAX_AUDIO_BYTES:
                    return None, f"Audio file exceeds maximum allowed size ({_MAX_AUDIO_BYTES} bytes)"
                chunks.append(chunk)
            return b"".join(chunks), None
    except requests.RequestException as e:
        return None, str(e)

Could you double-check whether the fix was pushed to the branch? It may not have made it into the latest commit that was reviewed.


🧬 Code Graph Analysis Results
  • File: backend/app/crud/jobs.py, Lines 11-49
    Content:

    class JobCrud:
        def __init__(self, session: Session):
            self.session = session
    
        def create(
            self,
            job_type: JobType,
            trace_id: str | None = None,
            project_id: int | None = None,
        ) -> Job:
            new_job = Job(job_type=job_type, trace_id=trace_id, project_id=project_id)
            self.session.add(new_job)
            self.session.commit()
            self.session.refresh(new_job)
            return new_job
    
        def update(self, job_id: UUID, job_update: JobUpdate) -> Job:
            job = self.session.get(Job, job_id)
            if not job:
                raise ValueError(f"Job not found with the given job_id {job_id}")
    
            update_data = job_update.model_dump(exclude_unset=True)
            for field, value in update_data.items():
                setattr(job, field, value)
    
            job.updated_at = now()
            self.session.add(job)
            self.session.commit()
            self.session.refresh(job)
    
            return job
    
        def get(self, job_id: UUID, project_id: int) -> Job | None:
            job = self.session.get(Job, job_id)
            if job is None:
                return None
            if job.project_id not in (None, project_id):
                return None
            return job
    

    Notes:

    • Provides basic CRUD for Job with create, update, get.
    • Update raises ValueError if job not found.
  • File: backend/app/crud/llm.py, Lines 56-162
    Summary:

    • create_llm_call(session, *, request, job_id, chain_id=None, project_id, organization_id, resolved_config, original_provider) -> LlmCall
      • Creates an LLM call record.
      • Derives input_type and output_type from the completion type (e.g., stt, tts, text/image/pdf/multimodal).
      • Stores config either as a stored reference (config_id + version) or inline blob (resolved_config).
      • Attaches conversation_id if present in the request.
      • Returns the created LlmCall.
      • Logs creation; may raise HTTPException on issues with stored config retrieval or validation.
    • update_llm_call_input(session, llm_call_id, s3_uri) -> None
      • Overwrites the llm_call.input with an S3 URI after STT upload.
      • If the LLM call is not found, logs a warning and returns.
    • update_llm_call_response(session, *, llm_call_id, provider_response_id=None, content=None, usage=None, conversation_id=None) -> LlmCall
      • Updates an LLM call with provider response, usage, and optional conversation_id.
      • If content is audio, computes audio_size_bytes when possible (based on base64-encoded content).
      • Updates updated_at, persists, and returns the updated LlmCall.
        These functions are used by execute_llm_call to persist request/response data.
  • File: backend/app/services/llm/providers/registry.py, Lines 55-85
    Content:

    def get_llm_provider(
        session: Session, provider_type: str, project_id: int, organization_id: int
    ) -> BaseProvider:
        from app.crud.credentials import get_provider_credential
    
        provider_class = LLMProvider.get_provider_class(provider_type)
    
        # e.g., "openai-native" → "openai", "claude-native" → "claude"
        credential_provider = provider_type.replace("-native", "")
    
        credentials = get_provider_credential(
            session=session,
            provider=credential_provider,
            project_id=project_id,
            org_id=organization_id,
        )
    
        if not credentials:
            raise ValueError(
                f"Credentials for provider '{credential_provider}' not configured for this project."
            )
    
        try:
            client = provider_class.create_client(credentials=credentials)
            return provider_class(client=client)
        except ValueError:
            # Re-raise ValueError for credential/configuration errors
            raise
        except Exception as e:
            logger.error(f"Failed to initialize {provider_type} client: {e}", exc_info=True)
            raise RuntimeError(f"Could not connect to {provider_type} services.")
    

    Notes:

    • Resolves a provider class from type string, fetches credentials, and instantiates a client.
    • Raises ValueError if credentials are missing; otherwise may raise.
  • File: backend/app/core/langfuse/langfuse.py, Lines 177-294
    Summary:

    • Function: observe_llm_execution(session_id=None, credentials=None) -> Callable
    • Returns a decorator that, when wrapping a provider's execute method,:
      • Initializes Langfuse client if credentials provided; otherwise logs and proceeds.
      • Creates a Langfuse trace and, if possible, a generation trace for the model.
      • Calls the wrapped provider execute method and records success/failure metrics to Langfuse.
      • On success, ends traces and flushes; on failure, updates traces with error details.
    • Handles exceptions by delegating to the wrapped function (unless the Langfuse client fails to initialize).
  • File: backend/app/core/telemetry.py, Lines 259-295
    Content (set_gen_ai_request_attributes):

    def set_gen_ai_request_attributes(
        span: trace.Span,
        *,
        provider: str,
        model: str,
        operation: str,
        organization_id: int | None,
        project_id: int | None,
        params: dict[str, Any] | None = None,
    ) -> None:
        """Set OTel GenAI request attributes on `span` (semantic-convention keys + kaapi ids)."""
        span.set_attribute("gen_ai.system", provider)
        span.set_attribute("gen_ai.provider.name", provider)
        span.set_attribute("gen_ai.operation.name", operation)
        if model:
            span.set_attribute("gen_ai.request.model", model)
        if organization_id is not None:
            span.set_attribute("kaapi.organization_id", organization_id)
            span.set_attribute("gen_ai.request.organization_id", organization_id)
        if project_id is not None:
            span.set_attribute("kaapi.project_id", project_id)
            span.set_attribute("gen_ai.request.project_id", project_id)
    
        params = params or {}
        for attr_key, param_key in (
            ("gen_ai.request.temperature", "temperature"),
            ("gen_ai.request.max_tokens", "max_tokens"),
            ("gen_ai.request.top_p", "top_p"),
            ("gen_ai.request.presence_penalty", "presence_penalty"),
            ("gen_ai.request.frequency_penalty", "frequency_penalty"),
        ):
            if param_key in params:
                span.set_attribute(attr_key, params.get(param_key))
    
        tools = params.get("tools")
        if tools is not None:
            span.set_attribute("gen_ai.request.available_tools", json.dumps(tools))
    

    Content (set_gen_ai_response_attributes):

    def set_gen_ai_response_attributes(
        span: trace.Span, *, response: "LLMCallResponse"
    ) -> None:
        """Set OTel GenAI response attributes (usage, model) on `span`."""
        usage = response.usage
        if usage:
            span.set_attribute("gen_ai.usage.input_tokens", usage.input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", usage.output_tokens)
            span.set_attribute("gen_ai.usage.total_tokens", usage.total_tokens)
            if getattr(usage, "reasoning_tokens", None) is not None:
                span.set_attribute(
                    "gen_ai.usage.output_tokens.reasoning", usage.reasoning_tokens
                )
    
        if response.response and response.response.model:
            span.set_attribute("gen_ai.response.model", response.response.model)
    
  • File: backend/app/models/llm/request.py, Lines 424-458
    Summary:

    • Class LLMCallRequest(SQLModel)
    • Fields:
      • query: QueryParams
      • config: LLMCallConfig
      • callback_url: HttpUrl | None
      • include_provider_raw_response: bool
      • request_metadata: dict[str, Any] | None
    • Purpose:
      • Represents an LLM call request, where config can be a stored config (id + version) or an ad-hoc blob.
    • Notes:
      • Used by execute_job and execute_llm_call to drive processing and callbacks.
  • File: backend/app/api/routes/llm.py, Lines 53-108
    Summary:

    • Endpoint llm_call(_current_user, session, request)
    • Determines project_id and organization_id from context.
    • Validates callback URL if provided.
    • Calls start_job(...) to schedule background LLM call.
    • Fetches job info and returns immediate APIResponse with job details (possibly including callback messaging).
    • Purpose:
      • Initiates an LLM call as a background job and returns status-like info for polling or callback delivery.
  • File: backend/app/celery/utils.py, Lines 31-38 and 41-56
    Summary:

    • start_llm_job(project_id, job_id, trace_id="N/A", **kwargs)
    • start_llm_chain_job(project_id, job_id, trace_id="N/A", **kwargs)
    • Both enqueue Celery tasks (run_llm_job / run_llm_chain_job) with trace context
    • Log task start information for observability
    • Purpose:
      • Provide standard helpers to enqueue LLM and LLM chain jobs via Celery with tracing context.
  • File: backend/app/services/llm/chain/chain.py, Lines 82-99
    Summary:

    • class ChainBlock.execute(self, query)
    • Executes a single block by calling execute_llm_call with the block's config and context
    • Returns a BlockResult
    • Notes:
    • Part of the chain execution flow; used by LLMChain to run blocks sequentially.
  • File: backend/app/services/llm/chain/chain.py, Lines 109-137
    Summary:

    • class LLMChain
    • execute(self, query, on_block_completed=None): runs blocks sequentially
    • For each block, runs block.execute(current_query), aggregates results, and passes output to next block
    • Returns final BlockResult
    • Notes:
    • Central orchestration for chain-based LLM tasks.
  • File: backend/app/services/llm/guardrails.py, Lines 82-169
    Summary:

    • list_validators_config(...) -> tuple[list[dict], list[dict]]
    • run_guardrails_validation(...)
    • Purpose:
      • Fetch guardrail validator configs and run validations on inputs/outputs
      • Returns results used by apply_input_guardrails and apply_output_guardrails
  • File: backend/app/core/storage_utils.py, Lines 225-248
    Summary:

    • upload_audio_bytes_to_s3(...)
    • Uploads audio bytes to S3-like storage, returns s3:// URI on success
    • Parameters include storage, audio_bytes, call_id, mime_type, prefix
    • Returns s3:// URI or None on failure
  • File: backend/app/core/storage.py (referenced by get_cloud_storage) and related storage usage
    Summary:

    • get_cloud_storage(session, project_id) -> CloudStorage
    • Retrieves project to determine storage path, creates AmazonCloudStorage
    • Raises HTTPException if project invalid or storage creation fails
  • File: backend/app/services/llm/chain/executor.py, Lines 26-225
    Summary:

    • class ChainExecutor
    • run(self): runs the chain via _setup, chain.execute, and _teardown
    • _setup: marks parent job/chain as processing/running and stores webhook secret
    • _resolve_presigned_url(output): converts s3:// URIs to presigned URLs for outputs
    • _teardown(result): on success, builds callback data, calls webhook, updates statuses
    • _handle_error(error): builds failure response, calls webhook, updates statuses
    • _on_block_completed: aggregates usage; may trigger intermediate callbacks
    • _send_intermediate_callback: sends partial results if enabled
    • _handle_unexpected_error: returns a generic failure path
      Notes:
    • Encapsulates end-to-end lifecycle of a chain run, including callbacks and storage URL handling.

Comment thread backend/app/services/llm/jobs.py
Comment thread backend/app/utils.py
Comment on lines +595 to +606
def download_audio_bytes(url: str) -> tuple[bytes | None, str | None]:
"""Download audio from a public URL. Returns (bytes, error)."""
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
return response.content, None
except requests.exceptions.Timeout:
return None, f"Timed out downloading audio from URL: {url}"
except requests.exceptions.HTTPError as e:
return None, f"HTTP {e.response.status_code} downloading audio from URL: {url}"
except Exception as e:
return None, f"Failed to download audio from URL: {str(e)}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Critical SSRF risk — user-supplied audio URLs are fetched without validation.

download_audio_bytes is invoked for AudioContent payloads coming directly from API callers (via resolve_audio_url and from jobs.py for STT URL inputs). With requests.get here:

  • The URL is never validated, so a caller can point value at internal IPs (e.g., 127.0.0.1, 169.254.169.254, RFC1918 ranges), achieving SSRF against the cloud metadata service or internal control planes.
  • allow_redirects defaults to True, so even an HTTPS public host can redirect to an internal address; per-hop validation is required.
  • The full body is buffered into memory with response.content and no Content-Length / streaming cap, enabling a memory-exhaustion DoS via a large/zip-bomb-style response.
  • HTTP (non-TLS) URLs are not rejected.

This file already implements validate_callback_url and _is_private_ip for the outbound callback path — the same protections should apply here. At minimum:

  1. Reject non-HTTPS schemes.
  2. Resolve the hostname and reject private/loopback/link-local/reserved addresses (re-resolve after each redirect, or disable redirects).
  3. Enforce an upper bound on downloaded bytes (iter_content with a running total, or check Content-Length).
🛡️ Sketch of a hardened implementation
-def download_audio_bytes(url: str) -> tuple[bytes | None, str | None]:
-    """Download audio from a public URL. Returns (bytes, error)."""
-    try:
-        response = requests.get(url, timeout=30)
-        response.raise_for_status()
-        return response.content, None
-    except requests.exceptions.Timeout:
-        return None, f"Timed out downloading audio from URL: {url}"
-    except requests.exceptions.HTTPError as e:
-        return None, f"HTTP {e.response.status_code} downloading audio from URL: {url}"
-    except Exception as e:
-        return None, f"Failed to download audio from URL: {str(e)}"
+MAX_AUDIO_DOWNLOAD_BYTES = 50 * 1024 * 1024  # 50 MiB
+
+def download_audio_bytes(url: str) -> tuple[bytes | None, str | None]:
+    """Download audio from a public URL with SSRF + size protection."""
+    try:
+        validate_callback_url(url)  # reuses SSRF guard (HTTPS + IP checks)
+    except ValueError as ve:
+        return None, f"Disallowed audio URL: {ve}"
+    try:
+        with requests.get(
+            url, timeout=30, stream=True, allow_redirects=False
+        ) as response:
+            response.raise_for_status()
+            content_length = response.headers.get("Content-Length")
+            if content_length and int(content_length) > MAX_AUDIO_DOWNLOAD_BYTES:
+                return None, "Audio file exceeds maximum allowed size"
+            buf = bytearray()
+            for chunk in response.iter_content(chunk_size=64 * 1024):
+                buf.extend(chunk)
+                if len(buf) > MAX_AUDIO_DOWNLOAD_BYTES:
+                    return None, "Audio file exceeds maximum allowed size"
+            return bytes(buf), None
+    except requests.exceptions.Timeout:
+        return None, f"Timed out downloading audio from URL: {url}"
+    except requests.exceptions.HTTPError as e:
+        return None, f"HTTP {e.response.status_code} downloading audio from URL: {url}"
+    except Exception as e:
+        return None, f"Failed to download audio from URL: {str(e)}"

Note: validate_callback_url currently enforces HTTPS-only — confirm that requirement matches your STT customer contract; if HTTP must be allowed, fork the check to keep the IP-range protection but relax the scheme.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/utils.py` around lines 595 - 606, download_audio_bytes currently
performs an unsafe requests.get without URL validation, redirects handling, host
resolution checks, or streaming limits; update download_audio_bytes to (1)
enforce HTTPS-only (reuse validate_callback_url logic or call it), (2) disable
automatic redirects (requests.get(..., allow_redirects=False)) and on any 3xx
response validate the Location by resolving its hostname and re-checking against
_is_private_ip before following, or loop manually following redirects with
per-hop re-resolution, (3) resolve the request hostname to IP(s) and reject
private/loopback/link-local/reserved addresses using the existing _is_private_ip
helper, and (4) stream the response via iter_content with a running byte counter
and an enforced max bytes cap (and/or validate Content-Length before
downloading) to prevent memory/DoS; also ensure any HTTP (non-TLS) scheme is
rejected and surface clear error messages returned by download_audio_bytes so
callers like resolve_audio_url and jobs.py receive safe failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Voicenote Storage: Optimize audio handling

3 participants